Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • cluster ward | Insufficient memory for ClusterMatrix | r(950)

    Dear Statalist,
    I have a sample of about 38,000 observations and 9 variables. I want to perform a Ward's linkage cluster analysis. However, whenever I try to execute the "cluster ward" command in Stata, I get the following message:


    insufficient memory for ClusterMatrix
    r(950);


    I have also tried to run the analysis from a server with more than 128 GB of RAM. But I always get the same error message.


    How can I solve this issue in your opinion? Below you can find more information about my problem.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(var1 var2 var3 var4 var5 var6 var7 var8) byte gender
           0        0 0   0   1        0        0   1 1
    .3333333 .6666667 0   0   0 .3333333 .3333333   1 0
           0       .5 0  .5   0        0        1   1 1
           0        1 0   0   0        0        0   1 1
           0        1 0   0   0        0 .3333333   1 1
         .25      .75 0   0   0      .25      .25   1 0
          .5       .5 0   0   0       .5        1   1 1
           0       .5 0  .5   0       .5       .5   1 0
         .25       .5 0   0 .25      .25      .25   1 1
           0        0 0   1   0        1        0   1 1
           0        1 0   0   0        0        0   0 1
          .2       .6 0  .2   0       .2       .2   1 1
          .5       .5 0   0   0       .5       .5   1 1
           1        0 0   0   0        1        1   1 0
           0       .5 0  .5   0        0        0   1 1
           1        0 0   0   0        1        1   1 0
           0      .75 0 .25   0        0      .25 .75 1
    .3333333 .6666667 0   0   0 .6666667 .3333333   1 1
           0        0 1   0   0        0        0   1 0
           0        1 0   0   0        0        1   1 1
    end
    label values gender gender
    label def gender 0 "Men", modify
    label def gender 1 "Women", modify
    
    *>> Cluster analysis (Ward method)
    cluster ward     ///
    var1            ///
    var2            ///
    var3            ///
    var4            ///
    var5            ///
    var6            ///
    var7            ///
    var8            ///
    if gender==1, name(my_cluster_women)
    Last edited by Matthew Campbell; 10 Mar 2022, 08:43.

  • #2
    That is a big ask for a problem that entails pairwise comparisons..

    I would look for clusters in graphs of leading principal components on var?. Alternatively, it could be that you have many duplicates and can slim down the dataset to one with distinct observations. Classification isn''t affected by the presence of duplicates. as I understand it.

    Comment


    • #3
      The full PDF documentation for the cluster linkage command tells us
      Technical note

      cluster commands require a significant amount of memory and execution time. With many observations, the execution time may be significant.
      38,000 observations qualifies as "many", probably even after you reduce that to those with gender==1.

      And the PDF documentation for the overview of the cluster analysis commands tells us
      The first step of an agglomerative algorithm considers N(N-1)/2 possible fusions of observations
      to find the closest pair. This number grows quadratically with N.
      I'm of the opinion that hierarchical cluster analysis is out of the question for a problem of the size you present.

      Comment


      • #4
        Dear Nick Cox and William Lisowski,
        thanks a lot for your answers. From what I understood, it was better to leave the hierarchical cluster analysis out. For this reason I opted for partitioning methods (more precisely k-means) as they were less computationally demanding...

        Comment

        Working...
        X